Goto

Collaborating Authors

 Chester


Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Kim, Hyunjae, Sohn, Jiwoong, Gilson, Aidan, Cochran-Caggiano, Nicholas, Applebaum, Serina, Jin, Heeju, Park, Seihee, Park, Yujin, Park, Jiyeong, Choi, Seoyoung, Contreras, Brittany Alexandra Herrera, Huang, Thomas, Yun, Jaehoon, Wei, Ethan F., Jiang, Roy, Colucci, Leah, Lai, Eric, Dave, Amisha, Guo, Tuo, Singer, Maxwell B., Koo, Yonghoe, Adelman, Ron A., Zou, James, Taylor, Andrew, Cohan, Arman, Xu, Hua, Chen, Qingyu

arXiv.org Artificial Intelligence

Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.


Evaluating the Limitations of Local LLMs in Solving Complex Programming Challenges

Matotek, Kadin, Cassel, Heather, Amiruzzaman, Md, Ngo, Linh B.

arXiv.org Artificial Intelligence

This study examines the performance of today's open-source, locally hosted large-language models (LLMs) in handling complex competitive programming tasks with extended problem descriptions and contexts. Building on the original Framework for AI-driven Code Generation Evaluation (FACE), the authors retrofit the pipeline to work entirely offline through the Ollama runtime, collapsing FACE's sprawling per-problem directory tree into a handful of consolidated JSON files, and adding robust checkpointing so multi-day runs can resume after failures. The enhanced framework generates, submits, and records solutions for the full Kattis corpus of 3,589 problems across eight code-oriented models ranging from 6.7-9 billion parameters. The submission results show that the overall pass@1 accuracy is modest for the local models, with the best models performing at approximately half the acceptance rate of the proprietary models, Gemini 1.5 and ChatGPT-4. These findings expose a persistent gap between private, cost-controlled LLM deployments and state-of-the-art proprietary services, yet also highlight the rapid progress of open models and the practical benefits of an evaluation workflow that organizations can replicate on in-house hardware.


The study of short texts in digital politics: Document aggregation for topic modeling

Nakka, Nitheesha, Yalcin, Omer F., Desmarais, Bruce A., Rajtmajer, Sarah, Monroe, Burt

arXiv.org Artificial Intelligence

Statistical topic modeling is widely used in political science to study text. Researchers examine documents of varying lengths, from tweets to speeches. There is ongoing debate on how document length affects the interpretability of topic models. We investigate the effects of aggregating short documents into larger ones based on natural units that partition the corpus. In our study, we analyze one million tweets by U.S. state legislators from April 2016 to September 2020. We find that for documents aggregated at the account level, topics are more associated with individual states than when using individual tweets. This finding is replicated with Wikipedia pages aggregated by birth cities, showing how document definitions can impact topic modeling results.


Neural network interpretability with layer-wise relevance propagation: novel techniques for neuron selection and visualization

Bhati, Deepshikha, Neha, Fnu, Amiruzzaman, Md, Guercio, Angela, Shukla, Deepak Kumar, Ward, Ben

arXiv.org Artificial Intelligence

Interpreting complex neural networks is crucial for understanding their decision-making processes, particularly in applications where transparency and accountability are essential. This proposed method addresses this need by focusing on layer-wise Relevance Propagation (LRP), a technique used in explainable artificial intelligence (XAI) to attribute neural network outputs to input features through backpropagated relevance scores. Existing LRP methods often struggle with precision in evaluating individual neuron contributions. To overcome this limitation, we present a novel approach that improves the parsing of selected neurons during LRP backward propagation, using the Visual Geometry Group 16 (VGG16) architecture as a case study. Our method creates neural network graphs to highlight critical paths and visualizes these paths with heatmaps, optimizing neuron selection through accuracy metrics like Mean Squared Error (MSE) and Symmetric Mean Absolute Percentage Error (SMAPE). Additionally, we utilize a deconvolutional visualization technique to reconstruct feature maps, offering a comprehensive view of the network's inner workings. Extensive experiments demonstrate that our approach enhances interpretability and supports the development of more transparent artificial intelligence (AI) systems for computer vision applications. This advancement has the potential to improve the trustworthiness of AI models in real-world machine vision applications, thereby increasing their reliability and effectiveness.


A Tiered GAN Approach for Monet-Style Image Generation

Neha, FNU, Bhati, Deepshikha, Shukla, Deepak Kumar, Amiruzzaman, Md

arXiv.org Artificial Intelligence

Generative Adversarial Networks (GANs) have proven to be a powerful tool in generating artistic images, capable of mimicking the styles of renowned painters, such as Claude Monet. This paper introduces a tiered GAN model to progressively refine image quality through a multi-stage process, enhancing the generated images at each step. The model transforms random noise into detailed artistic representations, addressing common challenges such as instability in training, mode collapse, and output quality. This approach combines downsampling and convolutional techniques, enabling the generation of high-quality Monet-style artwork while optimizing computational efficiency. Experimental results demonstrate the architecture's ability to produce foundational artistic structures, though further refinements are necessary for achieving higher levels of realism and fidelity to Monet's style. Future work focuses on improving training methodologies and model complexity to bridge the gap between generated and true artistic images. Additionally, the limitations of traditional GANs in artistic generation are analyzed, and strategies to overcome these shortcomings are proposed.


From classical techniques to convolution-based models: A review of object detection algorithms

Neha, Fnu, Bhati, Deepshikha, Shukla, Deepak Kumar, Amiruzzaman, Md

arXiv.org Artificial Intelligence

Object detection is a fundamental task in computer vision and image understanding, with the goal of identifying and localizing objects of interest within an image while assigning them corresponding class labels. Traditional methods, which relied on handcrafted features and shallow models, struggled with complex visual data and showed limited performance. These methods combined low-level features with contextual information and lacked the ability to capture high-level semantics. Deep learning, especially Convolutional Neural Networks (CNNs), addressed these limitations by automatically learning rich, hierarchical features directly from data. These features include both semantic and high-level representations essential for accurate object detection. This paper reviews object detection frameworks, starting with classical computer vision methods. We categorize object detection approaches into two groups: (1) classical computer vision techniques and (2) CNN-based detectors. We compare major CNN models, discussing their strengths and limitations. In conclusion, this review highlights the significant advancements in object detection through deep learning and identifies key areas for further research to improve performance.


Classified as unknown: A novel Bayesian neural network

Yang, Tianbo, Yang, Tianshuo

arXiv.org Artificial Intelligence

We establish estimations for the parameters of the output distribution for the softmax activation function using the probit function. As an application, we develop a new efficient Bayesian learning algorithm for fully connected neural networks, where training and predictions are performed within the Bayesian inference framework in closed-form. This approach allows sequential learning and requires no computationally expensive gradient calculation and Monte Carlo sampling. Our work generalizes the Bayesian algorithm for a single perceptron for binary classification in \cite{H} to multi-layer perceptrons for multi-class classification.


Distributed Swarm Intelligence

Kanjula, Karthik Reddy, Kolla, Sai Meghana

arXiv.org Artificial Intelligence

This paper presents the development of a distributed application that facilitates the understanding and application of swarm intelligence in solving optimization problems. The platform comprises a search space of customizable random particles, allowing users to tailor the solution to their specific needs. By leveraging the power of Ray distributed computing, the application can support multiple users simultaneously, offering a flexible and scalable solution. The primary objective of this project is to provide a user-friendly platform that enhances the understanding and practical use of swarm intelligence in problem-solving.


Light in the Larynx: a Miniaturized Robotic Optical Fiber for In-office Laser Surgery of the Vocal Folds

Chiluisa, Alex J., Pacheco, Nicholas E., Do, Hoang S., Tougas, Ryan M., Minch, Emily V., Mihaleva, Rositsa, Shen, Yao, Liu, Yuxiang, Carroll, Thomas L., Fichera, Loris

arXiv.org Artificial Intelligence

This letter reports the design, construction, and experimental validation of a novel hand-held robot for in-office laser surgery of the vocal folds. In-office endoscopic laser surgery is an emerging trend in Laryngology: It promises to deliver the same patient outcomes of traditional surgical treatment (i.e., in the operating room), at a fraction of the cost. Unfortunately, office procedures can be challenging to perform; the optical fibers used for laser delivery can only emit light forward in a line-of-sight fashion, which severely limits anatomical access. The robot we present in this letter aims to overcome these challenges. The end effector of the robot is a steerable laser fiber, created through the combination of a thin optical fiber (0.225 mm) with a tendon-actuated Nickel-Titanium notched sheath that provides bending. This device can be seamlessly used with most commercially available endoscopes, as it is sufficiently small (1.1 mm) to pass through a working channel. To control the fiber, we propose a compact actuation unit that can be mounted on top of the endoscope handle, so that, during a procedure, the operating physician can operate both the endoscope and the steerable fiber with a single hand. We report simulation and phantom experiments demonstrating that the proposed device substantially enhances surgical access compared to current clinical fibers.


Behavioral-clinical phenotyping with type 2 diabetes self-monitoring data

Levine, Matthew E., Albers, David J., Burgermaster, Marissa, Davidson, Patricia G., Smaldone, Arlene M., Mamykina, Lena

arXiv.org Machine Learning

Words: 4252 Keywords: self-monitoring data, type 2 diabetes, machine learning, phenotyping, precision medicine ABSTRACT Objective: To evaluate unsupervised clustering methods for identifying individual-level behavioral-clinical phenotypes that relate personal biomarkers and behavioral traits in type 2 diabetes (T2DM) self-monitoring data. Materials and Methods: We used hierarchical clustering (HC) to identify groups of meals with similar nutrition and glycemic impact for 6 individuals with T2DM who collected self-monitoring data. We evaluated clusters on: 1) correspondence to gold standards generated by certified diabetes educators (CDEs) for 3 participants; 2) face validity, rated by CDEs, and 3) impact on CDEs' ability to identify patterns for another 3 participants. Results: Gold standard (GS) included 9 patterns across 3 participants. Of these, all 9 were rediscovered using HC: 4 GS patterns were consistent with patterns identified by HC (over 50% of meals in a cluster followed the pattern); another 5 were included as subgroups in broader clusers. After reviewing clusters, CDEs identified patterns that were more consistent with data (70% reduction in contradictions between patterns and participants' records). Discussion: Hierarchical clustering of blood glucose and macronutrient consumption appears suitable for discovering behavioral-clinical phenotypes in T2DM. Most clusters corresponded to gold standard and were rated positively by CDEs for face validity. Cluster visualizations helped CDEs identify more robust patterns in nutrition and glycemic impact, creating new possibilities for visual analytic solutions. Conclusion: Machine learning methods can use diabetes self-monitoring data to create personalized behavioral-clinical phenotypes, which may prove useful for delivering personalized medicine.